cnt) for each data split.
| partition | fraction | mediant_cnt | mean_cnt | sd_cnt….sd.cnt..na.rm…TRUE. |
|---|---|---|---|---|
| Train | 0.75 | 142 | 189.8049 | 181.6294 |
| Test | 0.25 | 142 | 188.4385 | 180.6777 |
In our project, we will attempt to build a regression model using best subset selection to analyze bike-sharing data to predict rental demand. We will examine factors like weather, time, and holidays to understand their influence on bike usage.
Bike-sharing systems are an integral part of urban transportation (Winters, 2020). Understanding the factors driving bike-share demand can help urban planners optimize services. The dataset we used to form our regression is the Bike Sharing Dataset (dataset ID: 275) from the UCI Machine Learning Repository (Fanaee-T (2013)). It contains information on bike rentals, weather conditions, and time-related features. Our research question: How many bikes will be rented on a day based on weather and temporal factors?
The R programming language (R Core Team 2024) and the following R packages were used to perform the analysis: knitr (Xie 2024), tidyverse (Wickham et al. 2023), tidymodels (Kuhn and Wickham 2025), ucimlrepo (Dua and Graff 2024), leaps (Lumley 2024), mltools (Sailo 2018), and ggpubr (Kassambara 2023).
| Predictor Variable | Description |
|---|---|
| season | Season that the bike is rented in |
| holiday | If the day the bike was rented is a holiday |
| workingday | If the day the bike was rented is a work day |
| weathersit | What the weather was on the day the bike was rented |
| temp | What the temperature was on the day the bike was rented |
| hum | What the humidity was on the day the bike was rented |
| windspeed | What the windsped was on the day the bike was rented |
Our dataset was loaded and cleaned by ensuring correct factorization and removing irrelevant columns. We had no missing data or special characters so we did not have to worry about that.
In our exploratory analysis, we looked to see how dependent variables affected bike rental usage Figure 1. We also made a correlation matrix, Figure 3, to explore how correlated our variables are. We found multicollinearity between atemp and temp, so moving forward we will use temp in our analysis. Finally, we found that the distribution of bike rental counts was heavily right skewed. Because we plan to use linear regression and we want to maintain the assumption of normality, moving forward we will be using a log transformation on the cnt variable.
The data was split into training (75%) and testing (25%) sets, stratified by cnt (total bike counts). To ensure we had enough data representation in the test set, we computed the median, mean, and standard deviation for both data sets to make sure they were similar, which can be seen in Table 2.
cnt) for each data split.
| partition | fraction | mediant_cnt | mean_cnt | sd_cnt….sd.cnt..na.rm…TRUE. |
|---|---|---|---|---|
| Train | 0.75 | 142 | 189.8049 | 181.6294 |
| Test | 0.25 | 142 | 188.4385 | 180.6777 |
To determine the most appropriate model, we used the best subset framework, seen here Table 3. Because weather is a categorical variable, we needed to check if the model with or without weather did better to determine our final model.
| R2 | Adj.R2 |
|---|---|
| 7 | 7 |
We created two linear regression models with and without weather respecively to assess their impact on bike demand. Because the model that included weather had a higher adjusted R^2 (tbl-weather) , we decided to use that model as our final regression model, as seen in Table 5.
| Adj.R2_with | Adj.R2_without |
|---|---|
| 0.2603212 | 0.2578714 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4.3201706 | 0.0627384 | 68.860083 | 0.0000000 |
| season | 0.1623146 | 0.0109425 | 14.833443 | 0.0000000 |
| holiday | -0.1972719 | 0.0682631 | -2.889874 | 0.0038603 |
| workingday | -0.0599671 | 0.0249967 | -2.399000 | 0.0164539 |
| weathersit | 0.1299597 | 0.0195614 | 6.643675 | 0.0000000 |
| temp | 2.5407403 | 0.0620367 | 40.955410 | 0.0000000 |
| hum | -2.6681251 | 0.0687305 | -38.820079 | 0.0000000 |
| windspeed | 0.4388724 | 0.0977785 | 4.488434 | 0.0000072 |
To assess the model fit, we generated a residual plot, Figure 4. This plot indicates that even with our log transformation, the residuals are a bit heteroscedastic, and in future renditions of this project we plan to adopt a different, more appropriate model.
Finally, to evaluate prediction accuracy we calculated RMSE in Table 6, which we found to be 1.29 uses approximately, suggesting the model prediction is good and our model is useful.
| RMSE |
|---|
| 1.274948 |
We found that the ideal model for our data includes season, holiday status, wether it is a working day, the temperature, the humidity, and the wind speed. We found that our model became stronger with the inclusion of weather-related variables. Though none of these findins are individually surprising, we were surprised that all of the variables had an impact on bike demand prediction and wonder if more research can be done into what other variables may also be used in this model. These findings suggest that these variables can significantly influence bike demand, information that can be used to help increase total users.
Future Questions:
Could a non-linear model be more accurate in terms of prediction?
How do long-term weather trends affect the seasonal bike usage?
What other outside variables are impactful in the prediction of bike-share usage?